🎉 Destination Bigquery: added gcs upload option #5614

etsybaev · 2021-08-25T10:37:53Z

What

Currently, you may see some failures for big datasets and slow sources, i.e. if reading from source takes more than 10-12 hours.
This is caused by the Google BigQuery SDK client limitations. For more details please check #3549

How

There are 2 available options now to upload data to bigquery Standard and GCS Staging.

Standard is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.
GCS Uploading (CSV format). This is a newly introduced approach implemented in order to avoid the issue for big datasets mentioned above.
At the first step all data is uploaded to GCS bucket and then all moved to BigQuery at one shot stream by stream.
The destination-gcs connector is partially used under the hood here, so you may check its documentation for more details.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
Credentials added to Github CI. Instructions.
/test connector=connectors/<name> command is passing.
New Connector version released on Dockerhub by running the /publish command described here

Updating a connector

Community member or Airbyter

Grant edit access to maintainers (instructions)
Secrets in the connector's spec are annotated with airbyte_secret
Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
Code reviews completed
Documentation updated
- Connector's README.md
- Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
PR name follows PR naming conventions
Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
Credentials added to Github CI. Instructions.
/test connector=connectors/<name> command is passing.
New Connector version released on Dockerhub by running the /publish command described here

Connector Generator

Issue acceptance criteria met
PR name follows PR naming conventions
If adding a new generator, add it to the list of scaffold modules being tested
The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
Documentation which references the generator is updated as needed.

…igquery

…igquery (GCS upload mode)

…ead of hardcoded and minor refactor

…-destination

etsybaev · 2021-08-26T17:24:01Z

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1171385872
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1171385872

…n is complete

flagbug · 2021-08-30T17:12:54Z

@etsybaev While looking for a workaround with the BigQuery connector failing with large datasets, I tried first manually uploading our tables to GCS but soon hit a limitation with the current GCS implementation as described here: #5720

Since I'm assuming that you're reusing the same logic for the GCS writer I just wanted to let you know that this might also impact this PR.

irynakruk

Nice pr!

subodh1810

There is a major problem that I dont understand. Why are we using Amazon S3 client everywhere if we are trying to load data via GCS?

subodh1810 · 2021-09-06T18:08:24Z

airbyte-integrations/connectors/destination-bigquery/README.md

@@ -1,3 +1,13 @@
+## Uploading options
+There are 2 available options to upload data to bigquery `Standard` and `GCS Staging`.
+- `Standard` is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.


I think we need to explain further about when to choose which option. Its not clear to me when should I choose Standard vs GCS Uploading (CSV format)

Updated, thanks

subodh1810 · 2021-09-06T18:12:46Z

...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java

+      throws IOException {
+    Timestamp uploadTimestamp = new Timestamp(System.currentTimeMillis());
+
+    AmazonS3 s3Client = GcsS3Helper.getGcsS3Client(gcsDestinationConfig);


Why are we creating AmazonS3 client if we are using Google cloud storage for loading data

GCS is compatible with the Amazon S3 client. By reusing the S3 client, we can reuse the related code.

Hi @subodh1810. I just partially re-used our already existing destination-gcs module that had been created before I started working on this ticket. I had some conversation with @tuliren and he confirmed that S3 client was also used for destination-gcs as it mostly works with both except for some minor cases.

subodh1810 · 2021-09-06T18:15:28Z

...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java

+        GcsDestinationConfig gcsDestinationConfig = GcsDestinationConfig
+            .getGcsDestinationConfig(BigQueryUtils.getGcsJsonNodeConfig(config));
+        GcsCsvWriter gcsCsvWriter = initGcsWriter(gcsDestinationConfig, configStream);
+        gcsCsvWriter.initialize();


In the initialize method we are using AmazonS3 client and I am not sure I follow why is that

Hi @subodh1810. The answer here is similar to the previous comment. I partially re-used the already existing destination-gcs module (GCS destination CSV writer in particular). But destination-gcs had been implemented re-using already existed modules from destination-S3 as far as I understood. The reason is that the Amazon S3 client in most of the cases works for both GCS and S3 storages. It's better to ask @tuliren for details.
So the idea is the next:

Re-using the existing destination-gcs (CSV writer) we upload data to GCS bucket

Using the native Bigquery writer we create a native job to migrate CSV formatted data from bucket to BigQuery

subodh1810

Can you just add a java doc explaining that Amazon s3 client works with GCS as well

etsybaev · 2021-09-08T06:35:47Z

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212255266
✅ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212255266

…-destination # Conflicts: # settings.gradle

etsybaev · 2021-09-08T09:11:14Z

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098
❌ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098
🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098

etsybaev · 2021-09-08T09:50:15Z

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212830443
✅ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212830443

etsybaev · 2021-09-08T11:28:09Z

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213127186
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213127186

etsybaev · 2021-09-08T11:53:33Z

/publish connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213197602
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213197602

…5614)

* Fixed destination bigquery denormalized compilation error (caused by #5614)

sherifnada · 2021-09-10T06:22:54Z

@etsybaev see #5959 - why did we change the emitted_at column type? is that a change we can revert?

ChristopheDuong · 2021-09-28T11:20:38Z

...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java

-      Field.of(JavaBaseConstants.COLUMN_NAME_EMITTED_AT, StandardSQLTypeName.TIMESTAMP));
+      // GCS works with only date\datetime formats, so need to have it a string for a while
+      // https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#data_types
+      Field.of(JavaBaseConstants.COLUMN_NAME_EMITTED_AT, StandardSQLTypeName.STRING),


The type of the column when creating the schema of the table was changed from timestamp to STRING

But when formatting the records to insert, they are still typed as timestamp?

airbyte/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryRecordConsumer.java

Line 153 in 888e9ab

final String formattedEmittedAt = QueryParameterValue.timestamp(emittedAtMicroseconds).getValue();

ok so there was an issue for this here #5959

ChristopheDuong · 2021-10-08T17:17:11Z

settings.gradle

@@ -82,6 +82,10 @@ if(!System.getenv().containsKey("SUB_BUILD") || System.getenv().get("SUB_BUILD")
    include ':airbyte-integrations:connectors:destination-redshift'
    include ':airbyte-integrations:connectors:destination-snowflake'
    include ':airbyte-integrations:connectors:destination-oracle'
+
+    //Needed by destination-bugquery


I wish we had a destination-bigquery instead though... 😜

ievgeniit added 7 commits August 22, 2021 17:31

[5296] Integrated destination-gcs skeleton to bigquery destination

3f97970

[5296] Added skeleton for migration non_normilized data from gcs to b…

83c1b28

…igquery

[5296] Fixed check() method, implemented some debts for distination-b…

9ab7549

…igquery (GCS upload mode)

Updated destination-bigquery (gcs) to take some values from args inst…

d9a5609

…ead of hardcoded and minor refactor

Added tests data clearance

73b1128

Added tests for destination-bigquery (GCS) upload type

1ae2bb5

[5298] Minor refactor

db4621a

etsybaev linked an issue Aug 25, 2021 that may be closed by this pull request

Destination Bigquery: rewrite connector to use Bulk upload instead of current one #5296

Closed

github-actions bot added the area/connectors Connector related issues label Aug 25, 2021

ievgeniit added 3 commits August 25, 2021 14:49

Merge branch 'master' into etsybaev/5296-added-gcs-upload-to-bigquery…

a730bcd

…-destination

Fixed args handling

9adc574

Updated docs

e69528d

github-actions bot added the area/documentation Improvements or additions to documentation label Aug 26, 2021

jrhizor temporarily deployed to more-secrets August 26, 2021 13:44 Inactive

fixed spec

aba38e5

etsybaev changed the title ~~[DARFT_PR] Destination Bigquery: added gcs upload option~~ [DARFT_PR] 🎉Destination Bigquery: added gcs upload option Aug 26, 2021

jrhizor temporarily deployed to more-secrets August 26, 2021 17:26 Inactive

ievgeniit added 2 commits August 27, 2021 11:59

Added option to select if we want to keep files on GCS after migratio…

1f40bc2

…n is complete

Updated docs

30c67ac

etsybaev changed the title ~~[DARFT_PR] 🎉Destination Bigquery: added gcs upload option~~ 🎉 Destination Bigquery: added gcs upload option Aug 27, 2021

irynakruk reviewed Aug 31, 2021

View reviewed changes

irynakruk approved these changes Aug 31, 2021

View reviewed changes

etsybaev marked this pull request as ready for review August 31, 2021 13:15

etsybaev requested review from subodh1810 and tuliren September 2, 2021 13:07

subodh1810 requested changes Sep 6, 2021

View reviewed changes

Updated README.md

e99f476

subodh1810 approved these changes Sep 7, 2021

View reviewed changes

jrhizor temporarily deployed to more-secrets September 8, 2021 06:38 Inactive

ievgeniit added 2 commits September 8, 2021 10:09

Added javadoc describing why S3 used and bumped bigquery version

a5e702d

Merge branch 'master' into etsybaev/5296-added-gcs-upload-to-bigquery…

888e9ab

…-destination # Conflicts: # settings.gradle

jrhizor temporarily deployed to more-secrets September 8, 2021 08:55 Inactive

jrhizor temporarily deployed to more-secrets September 8, 2021 09:13 Inactive

jrhizor temporarily deployed to more-secrets September 8, 2021 09:52 Inactive

jrhizor temporarily deployed to more-secrets September 8, 2021 10:53 Inactive

jrhizor temporarily deployed to more-secrets September 8, 2021 11:30 Inactive

jrhizor temporarily deployed to more-secrets September 8, 2021 11:56 Inactive

etsybaev merged commit f32b14e into master Sep 8, 2021

etsybaev deleted the etsybaev/5296-added-gcs-upload-to-bigquery-destination branch September 8, 2021 12:21

etsybaev mentioned this pull request Sep 8, 2021

Destination Bigquery returns 404 when uploading data + resumable #3549

Closed

etsybaev linked an issue Sep 8, 2021 that may be closed by this pull request

Destination Bigquery returns 404 when uploading data + resumable #3549

Closed

etsybaev pushed a commit that referenced this pull request Sep 8, 2021

Fixed destination bigquery denormalized compilation error (caused by #…

a961061

…5614)

jrhizor mentioned this pull request Sep 8, 2021

Bump Airbyte version from 0.29.15-alpha to 0.29.16-alpha #5925

Merged

etsybaev added a commit that referenced this pull request Sep 9, 2021

🐛 Destination BigQuery Denormalized: Fixed compilation error (#5917)

bb043b7

* Fixed destination bigquery denormalized compilation error (caused by #5614)

marcosmarxm mentioned this pull request Sep 10, 2021

Mysterious switch of data type for <_airbyte_emitted_at> field crashes incremental jobs #5959

Closed

flagbug mentioned this pull request Sep 20, 2021

S3 and GCS writes are limited to 50GB #5720

Closed

ChristopheDuong reviewed Sep 28, 2021

View reviewed changes

ChristopheDuong reviewed Oct 8, 2021

View reviewed changes

karinakuz added connectors/destination/bigquery connectors/destinations-api connectors/destinations-warehouse and removed connectors/destinations-api labels Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 Destination Bigquery: added gcs upload option #5614

🎉 Destination Bigquery: added gcs upload option #5614

etsybaev commented Aug 25, 2021 •

edited

Loading

etsybaev commented Aug 26, 2021 •

edited by github-actions bot

Loading

flagbug commented Aug 30, 2021

irynakruk left a comment

subodh1810 left a comment

subodh1810 Sep 6, 2021

etsybaev Sep 7, 2021

subodh1810 Sep 6, 2021

tuliren Sep 6, 2021

etsybaev Sep 7, 2021

subodh1810 Sep 6, 2021

etsybaev Sep 7, 2021

subodh1810 left a comment

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

sherifnada commented Sep 10, 2021

ChristopheDuong Sep 28, 2021

ChristopheDuong Sep 28, 2021

ChristopheDuong Oct 8, 2021

🎉 Destination Bigquery: added gcs upload option #5614

🎉 Destination Bigquery: added gcs upload option #5614

Conversation

etsybaev commented Aug 25, 2021 • edited Loading

What

How

Pre-merge Checklist

Community member or Airbyter

Airbyter

Community member or Airbyter

Airbyter

etsybaev commented Aug 26, 2021 • edited by github-actions bot Loading

flagbug commented Aug 30, 2021

irynakruk left a comment

Choose a reason for hiding this comment

subodh1810 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

subodh1810 left a comment

Choose a reason for hiding this comment

etsybaev commented Sep 8, 2021 • edited by github-actions bot Loading

etsybaev commented Sep 8, 2021 • edited by github-actions bot Loading

etsybaev commented Sep 8, 2021 • edited by github-actions bot Loading

etsybaev commented Sep 8, 2021 • edited by github-actions bot Loading

etsybaev commented Sep 8, 2021 • edited by github-actions bot Loading

sherifnada commented Sep 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etsybaev commented Aug 25, 2021 •

edited

Loading

etsybaev commented Aug 26, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading

etsybaev commented Sep 8, 2021 •

edited by github-actions bot

Loading